Comprehensive full genome analysis of norovirus strains from eastern India, 2017–2021

Background Worldwide, noroviruses are the leading cause of acute gastroenteritis (AGE) in people of all age groups. In India, norovirus rates between 1.4 to 44.4% have been reported. Only a very few complete norovirus genome sequences from India have been reported. Objective To perform full genome sequencing of noroviruses circulating in India during 2017–2021, identify circulating genotypes, assess evolution including detection of recombination events. Methodology Forty-five archived norovirus-positive samples collected between October 2017 to July 2021 from patients with AGE from two hospitals in Kolkata, India were processed for full genome sequencing. Phylogenetic analysis, recombination breakpoint analysis and comprehensive mutation analysis were also performed. Results Full genome analysis of norovirus sequences revealed that strains belonging to genogroup (G)I were genotyped as GI.3[P13]. Among the different norovirus capsid-polymerase combinations, GII.3[P16], GII.4 Sydney[P16], GII.4 Sydney[P31], GII.13[P16], GII.16[P16] and GII.17 were identified. Phylogenetic analysis confirmed phylogenetic relatedness with previously reported norovirus strains and all viruses were analyzed by Simplot. GII[P16] viruses with multiple residue mutations within the non-structural region were detected among circulating GII.4 and GII.3 strains. Comprehensive mutation analysis and selection pressure analysis of GII[P16] viruses showed positive as well as negative selection sites. A GII.17 strain (NICED-BCH-11889) had an untypeable polymerase type, closely related to GII[P38]. Conclusion This study highlights the circulation of diverse norovirus strains in eastern India. These findings are important for understanding norovirus epidemiology in India and may have implications for future vaccine development. Supplementary Information The online version contains supplementary material available at 10.1186/s13099-023-00594-5.


Introduction
Noroviruses are a major cause of acute gastroenteritis (AGE) in people across all age groups.Worldwide, about 18% of AGE cases are associated with norovirus with an estimated 200,000 deaths annually [1,2].Norovirus outbreaks have a significant financial impact estimated to be more than $64 billion yearly [3,4].In spite of huge morbidity and mortality across the world, no antivirals or vaccines against norovirus have been licensed yet.
Noroviruses are genetically diverse group of viruses belonging to the family Caliciviridae.They contain a single stranded positive sense RNA genome of ∼ 7.6 kb long which is comprised of three open reading frames (ORFs).ORF1 encodes non-structural polyprotein which is proteolytically cleaved to produce six non-structural proteins including viral RNA dependent RNA polymerase (NS7).ORF2 and ORF3 encode major capsid protein (Viral Protein 1; VP1) and minor capsid protein (Viral protein 2; VP2) respectively [5].Based on amino acid diversity of VP1, ten norovirus genogroups (GI-GX) have been identified.Each genogroup comprises multiple genotypes.Humans can be infected by more than 36 different genotypes from 5 different genogroups (GI, GII, GIV, GVIII and GIX) [6].Of these, GII viruses are most frequently detected GII.4 viruses are the predominant genotype associated with 50-70% of all norovirus outbreaks [7].In addition to VP1 diversity, point mutations and recombination further contribute to the greater genetic diversity of noroviruses.One of the major recombination hotspots is located at the junction of ORF1 and ORF2, thus dual typing has become the standard method to type noroviruses [6].
Reported rates of norovirus infections in different parts of India vary widely.In northern and southern India, rates ranged between 1.4 and 44.4% [8,9], while in western India, a rate of 10.7% was reported [10,11].Previous studies from eastern India reported a prevalence of 3.1-6.0%among children with AGE [12][13][14][15][16].Most genotype reports are based on a frequently used small region of the genome and full genome norovirus sequences from India are lacking.In the present study, we performed full genome sequencing of noroviruses circulating during 2017-2021 to comprehensively characterize in eastern India.

Study site and ethical clearance
Kolkata is among the 30 megacities in the world and 3rd largest populous metropolitan in Eastern part of India.During this study, fecal specimens from people of all ages were collected from two tertiary care referral hospitals (Infectious Disease and Beliaghata General Hospital (IDH) and Dr. B.C. Roy Post-Graduate Institute of Pediatric Sciences (BCPGIPS)) (Additional file 6: Table S1).The study was approved by institutional ethics committees of participating institutes.

Sample curation and processing
A total of 45 archived norovirus-positive specimens collected between October 2017 to July 2021 were selected for whole genome sequencing (Additional file 6: Table S1).Stool specimens from patients with moderate AGE (Vesikari score; 8-10) and low Ct values (≤ 23) were included in the study.30% w/v stool suspensions were prepared with phosphate-buffered saline (PBS; pH 7.4) and clarified by centrifugation at 10,000g for 10 min.Viral RNA was isolated using Trizol LS Reagent (Life Technologies, Grand Island, NY, USA) and eluted with 50 µl nuclease free water and used for whole genome analysis by NGS.

cDNA library construction and illumina MiSeq sequencing
cDNA library preparation and Illumina MiSeq sequencing were carried out as described earlier [17].Briefly, a 200-bp fragment library ligated with bar-coded adapters was constructed for individual strains with a NEBNext ® Ultra ™ II RNA Library Prep Kit for Illumina ® (New England Biolabs, Ipswich, MA).Purification of the cDNA library was carried out with Agencourt AMPure XP magnetic beads (Beckman Coulter, Brea, CA).The quality assessment of the purified cDNA libraries was performed using a 4150 TapeStation (Agilent Technologies).About 151 paired end reads were generated by nucleotide sequencing by using an Illumina MiSeq sequencer (Illumina, San Francisco, CA) and a MiSeq Reagent Kit v2 (Illumina).CLC Genomics Workbench v7.0.3 (CLC Bio, Tokyo, Japan) was used to analyse the data.Contigs that shared a percent nucleotide identity of 95% or less were assembled from the obtained sequence reads by de novo assembly.The complete genome sequence of GII.17 strain (NICED-BCH-11889) was additionally determined by 5'-RACE [18].Near-full genome sequences were deposited in DDBJ under the following accession numbers: LC769681.1-LC769715.1.

Phylogenetic analysis
Multiple sequence alignments and phylogenetic analysis and were performed using MEGA 11 (Molecular Evolutionary Genetic Analysis ver 11.0.13)software.For the construction of phylogenetic tree best fitting substitution model was selected from model selection parameter in MEGA software.Substitution model GTR + G + I was chosen on the basis of lowest BIC and AIC(c) score.Pairwise sequence alignment and nucleotide or protein sequence identity were determined by using LALIGN tool (https:// www.ebi.ac.uk/ Tools/ psa/ lalign/) [21].

Detection of recombinant norovirus and analysis of recombination breakpoints
To identify break points in the genomes of the recombinant strains, the sequences were analysed using SimPlot 3.5.1 and Simplot + + software [22,23].The analysis was conducted using standard parameters of the program with a window size of 200 nt and a step size of 20 nt.

RNA secondary structure prediction
Conserved RNA secondary structures upstream of the sub genomic transcript region were predicted using RNAalifold web server and mFold (RNA Folding Form ver.2.3) software within the UNAFold web server.

Detection of adaptive evolution and amino acid logo analysis
The number of non-synonymous substitutions per non-synonymous site (dN) and synonymous substitutions per synonymous site (dS) as well as dN/dS ratio was calculated.To identify sites under selection pressure, we employed the maximum likelihood (ML) based Fixed Effects Likelihood method (FEL) method, Bayesian approach based Fast Unconstrained Bayesian Approximation method (FUBAR), and a combination of ML and counting approach based Single-Likelihood Ancestor Counting method(SLAC) in Datamonkey 2.0 webserver [24].P value < 0.1 was considered significant in SLAC and FEL method while posterior probabilities of > 0.9 was considered significant in FUBAR method.
Informative amino acid substitution has been identified by multiple protein sequence alignment tool in MEGA v11.0.13.The relative frequencies of amino acid occurrence (bits) of the proteins were visualized using WebLogo (ver.3.7.11).

Detection of recombination breakpoints
To characterise the potential recombination events of the norovirus strains identified in this study, a similarity plot analysis was performed by aligning the complete genomic sequences of the study strains with closely related strains (Fig. 3).Similarity plot analysis of GII.4 Sydney[P16] strain NICED-RV-515 revealed recombination break point at approximately nt 5040, which is Recombination breakpoints were analysed in the putative sub genomic transcription start region at NS/S junction (Fig. 4).In all GII.3[P16] and GII.13[P16] strains, recombination breakpoint was detected before or after the ORF2 start codon respectively (Fig. 4a, b).ln GII.4 Sydney[P16] and GII.4 Sydney[P31] strains probable recombination breakpoint was detected before ORF2 start codon but far away from sub genomic transcription start site (Fig. 4c, d).Interestingly, in all GII.4 Sydney[P16] strains probable recombination breakpoint was detected within the predicted secondary hairpin structure before the sub genomic transcription start region [26,27].

Analysis of predominantly circulating norovirus strains with GII.4 capsids
Alignment of GII.4 capsid protein sequences showed variations in the antigenic epitopes (A, C, D, E, F, G and I) in the P2 subdomain.Amino acid residues in the Epitope E, F and I were the most conserved, while variation in amino acid residues were detected in other antigenic epitopes (Fig. 5 and Additional file 8: Table S3) [28].Further analysis revealed alteration of amino acid sequences in the HBGA binding loops in the circulating GII.4 norovirus capsids (Additional file 9: Table S4).

Detection of adaptive evolution in circulating GII[P16]
The newly emerged GII[P16] polymerase types have acquired characteristics to combine with different capsid genotypes resulting in enhanced spread of this genotype globally.Analysis of amino acid sequences revealed several amino acid substitutions in the circulating ORF1 of GII.3[P16] and GII.4[P16] norovirus when compared to the reference GII[P16] strain (NC_039477.1)(Fig. 5).Several amino acid substitutions in the GII[P16] ORF1 of GII.3 viruses were different compared to GII[P16] ORF 1 of GII.4 viruses (Table 1 and Additional file 2: Fig. S3a-f ).SLAC, FEL and FUBAR method analysis of the non-structural genes within the ORF1 of GII[P16] P-types revealed a large number of pervasive purifying selection sites (Additional file 4: Fig. S4).Two pervasive positive selection sites within the NS1/2 gene (NS1/2: D51G/N and NS1/2: E79G/V) and one in NS7 gene (NS7: K405R) were detected by only FUBAR method (marked by red asterisk in Fig. 6).
Analysis of the histo-blood group antigen (HBGA) surface binding loops ( A-, B-, P-, S-, T-, U-and N-) in the Pdomain of VP1 capsid protein of NICED-BCH-11889 has identified novel changes in amino acid residues in the A-, B-, P-and T-loops compared to previously reported distinct GII.17 strains.Interestingly, a deletion of consecutive quadruple amino acid residues (aa343-aa346) in the P-loop of NICED-BCH-11889 strain was also noteworthy (Fig. 9).

Discussion
Genomic plasticity leading to emergence of diverse recombinant strains and limited cross-protection against viruses from different genotypes have made studies better understanding their evolution important.Although whole genome noroviruses from many different countries are available, only one such report is available from India [30].Our study provides a comprehensive analysis of norovirus among patients with acute diarrheal illness in eastern India.In contrast to other studies, where increased severity of diarrheal illness is associated with GII.4 viruses compared to non GII.4 viruses, no conclusive correlation between disease severity and individual genotypes was observed in our study [31][32][33].The limitation of our study is the relatively low number of samples (n = 45) that were studied.
Genetic diversity of noroviruses is shaped by evolution and recombination events primarily among viruses within the same genogroup [34].To date, of the four reported inter-genogroup recombinant norovirus strains, one strain (polymerase sequence of genogroup GI and capsid sequence of genogroup GII and genotype 4 (GII.4))was detected from Kolkata, India in 2006 [35,36].In the present study, all of the recombinants identified were inter-genotypic in nature.Most recombination events take place at the ORF1/ORF2 junction region leading to multiple combinations of capsid and Fig. 6 Analysis of informative amino acid substitutions in non-structural proteins of GII[P16] genotype associated with circulating GII.4 and GII.3 noroviruses.Amino acid substitution in circulating GII[P16] strains were compared with reference GII[P16] strain (NC_039477.1).Amino acid substitution with more than 5% frequencies were marked by green triangle and complete substitution sites were marked by blue star.Pervasive positive selection sites were indicated in red asterisk.(Details of this analysis are attached as Additional file 3: Fig. S3a-f ) Fig. 7 Phylogenetic analysis VP1 capsid and polymerase genes of a rare GII.17 norovirus strain NICED-BCH-11889.The closest similar polymerase strain Beijing53931/GII.25[P38](GQ856469.1)was marked in black circle and the study strain was indicated in black triangle polymerases, generating new recombinant strains that might have advancement in fitness, pathogenesis and/ or transmissibility compared to their ancestral strains [37,38].Changes in the VP1 enable norovirus to evade immunity developed against prior infection which has been proposed as the mechanism of changing GII.4 variant viruses between 2002 and 2015.[5].However, in recent years several genotypes have a novel ORF1 and polymerase types (e.g.GII.P16) [25].Emergence of a new GII[P16] ORF1 associated with multiple capsid genotypes (e.g., GII.4 Sydney[P16], GII.2[P16], GII.3[P16], GII.12[P16]) has been reported worldwide [39] and were also found in our study.The GIIP [16] polymerases detected in our study harboured many mutations in the part of ORF1 encoding non-structural proteins (NS1/2, NS3-NS6) other than the polymerase.Some of these mutations (like amino acid residue no.52, 53 and 76 of NS1/2) were similar to those reported in other GII[P16] polymerases from around the world [25].
Since 2002, new GII.4 variants replacing previous dominant GII. 4 S3) [41][42][43].Compared to the GII.4c construct included in the bivalent TAK-214 norovirus vaccine, the east-Indian GII.4 Fig. 8 Sequence similarity of GII.17 NICED-BCH-11889 strain.The horizontal axis represents the nucleotide positions and the vertical axis represents the sequence similarity compared to the GII.17 NICED-BCH-11889 (LC769700.1)strain.The relative position of different ORFs of norovirus and polymerase-capsid junction was shown above the similarity plot capsids had 6-7 amino acid differences in epitope A, 2 in epitope D and 5 in epitope G. Further serology studies are necessary to understand the degree of immunogenicity that may be conferred by this vaccine against the currently circulating norovirus strains.
Several distinct variants of GII.17 have evolved since 1976, with alterations in the surface exposed loops of the P2 subdomain of VP1 capsid protein [29,52].However, the GII.17 (NICED-BCH-11889) strain detected in our study is distinct from other GII.17 variants with mutations in A, B, P and T loops, particularly a four amino deletions in the P-loop on the protein's outer surface, that might have altered its binding ability with HBGA.

Conclusion
We determined and analyzed 35 full genome norovirus sequences from stool specimens collected between October 2017 to July 2021 from hospitalized children in two hospitals in Kolkata, India.Comprehensive molecular analyses demonstrated that a large variety of different genotypes, similar to those that have been reported globally providing the start of baseline information on which strains a future norovirus vaccine should protect against to reduce the burden of moderate to severe norovirus gastroenteritis in eastern India.

Fig. 1
Fig. 1 Maximum likelihood phylogenetic analysis of full genome sequences obtained from circulating GII noroviruses in eastern India.GII.4[P16] strains are marked in maroon square, GII.3[P16] are marked in sky blue circle, GII.13[P16] are marked in orange diamond, GII.16[P16] are in blue triangle, GII.4[P31] are in green square and GII.17[P-ut] are shown in red triangle.Bootstrap values < 70% are not shown.(*Partial sequences of genomic regions of noroviruses were reported previously and available in the Genbank database)

Fig. 2
Fig. 2 Maximum likelihood phylogenetic analysis of nearly complete genome sequences obtained from circulating GI noroviruses in eastern India.GI.3[P13] strains are marked in black circle.Bootstrap values < 70% are not shown

Fig. 4
Fig. 4 Analysis of sub-genomic transcription start region at ORF1/ORF2 junction of different norovirus strains circulated in eastern India; (a) GII.3[P16],(b) GII.13[P16], (c) GII.4 Sydney[P16] and (d) GII.4 Sydney[P31].Predicted RNA secondary structure with least ΔG are shown in the diagram.For simplicity only one representative strain is shown in the diagram and other strains showing similar structure are listed below the diagram.The predicted secondary structures are shown in antisense orientation, sub-genomic RNA initiation sites are indicated by an arrow and the start anticodons are indicated in black box

Fig. 5 Table 1
Fig. 5 Analysis of variation in amino acid sequence of antigenic epitope regions of VP1 capsid among GII.4 Sydney strains detected in Kolkata, 2017-2021 variants have emerged every 2-3 year until GII.4 Sydney emerged in 2012 and this GII.4 variant is till dominant in 2023.Out of the 5 antigenic epitope regions mapped on the capsid protein of the circulating GII.4 viruses, Epitope A is the most dynamic, conferring immune-elusive property to the virus.Though more similar in amino acid composition to the GII.4 Sydney 2012 capsid, heterogeneity among endemically circulating strains was evident at positions 297 and 372 of epitope A. Residues 297 and 372 residing close to each other on the capsid region have been found to be co-evolutionary in generating antigenic variants of GII.4 capsids [40].Seven of our GII.4strains had histidine and asparagine at positions 297 and 372 similar to GII.4 Sydney 2012 while other GII.4 capsids such as Grimsby 1995 and Farmington Hills 2002 GII.4 variants retained arginine and aspartic acid at these positions.Among the GII.4 Sydney [P16] strains, some had valine substituted for alanine at position 377 which was unique to the strains in our study.The GII.4 Sydney capsids with [P31] polymerases differed from most of the GII.4 strains (endemic GII[P16] and established antigenic variants of GII.4) at residue 340 of epitope C (glycine in place of threonine), residue 393 (aspartic acid instead of alanine/ serine) and 395 (threonine in place of alanine) of epitope D and residue 356 (asparagine in place of alanine) and 359 (threonine in place of alanine) at epitope G.These substitutions characterised the GII.4 capsids of both the [P31] polymerases detected in this study.The norovirus vaccine TAK-214 is a bivalent vaccine comprised of GI.1 and GII.4c component.The GII.4c component is a consensus sequence of three different GII.4 genotype variants-Yerseke_2006a(ABL74391.1),Den Haag_2006b(ABL74395.1) and Houston_2002 (ABY27560.1)(Additional file 8: Table

Fig. 9
Fig. 9 Amino acid sequence alignment of P domain from different GII.17 clusters.HBGA surface binding loops (A-, B-, P-, S-, T-, U-and N-) are heighted by different colours.The consecutive four amino acid deletions in the P-loop are marked by black arrows